Install and load in the libraries and data we need for this section:
# Set your working directory by clicking on the top menu:
# Session > Set Working Directory > To Source File Location
# Install packages
install.packages("dplyr")
# Load in libraries
library(dplyr)
# If you are want to read the information for a function, type 1 question mark in front of the function name:
?read.csv
# If you are want to know which package a function belongs to, type 2 question marks in front of the function name:
??read.csv
# Load in data
raw_data <- read.csv("data/raw_data.csv") View(): allows us to view the dataframe as a spreadsheet.
names(): give us the names of the columns in our data frame
dim(): tells us the dimensions of our dataframes.
summary(): gives us summary statistics (counts, min, median, mean, max).
head(): gives us the first 6 elements of the data
tail(): gives us the last 6 elements of the data
str(): tells us the variable type (e.g. Factor, num (number), int (integer))
unique(): tells us the unique elements of a variable.
? and ?? opens the helpfile for a function.
Try out the following commands to get to know the data?:
# How many entries does the data frame have?
View(raw_data)
# What are the names of the first 3 columns?
names(raw_data)
# What are the dimensions of our data?
dim(raw_data)
# Which species have more cases? What is the mean age of the organisms infected?
summary(raw_data)
# In which region did the 1st case occur?
head(raw_data)
# In which region did the last case occur?
tail(raw_data)
# Which variables are numbers (num)?
str(raw_data)
#What types of species do we have in the data?
unique(raw_data$species)
# What is the first argument for the function names?
?names()During this workshop, we will use functions available in the dplyr package to subset and summarise data.
dplyr functions can use the %>% (pipe) operator to chain together objects/functions. This passes the output of one function directly into the next. It can be helpful to ‘stack’ multiple functions without creating multiple visible outputs. You’ll see this in use in the following examples.
Subsetting is commonly used in R to select data that you would like to use. The select function can be used to select columns.
Compare the results of using the function normally vs. with the pipe operator:
select(raw_data, Age, Region)
raw_data %>%
select(Age, Region)The filter function can be used to select rows based on their values.
To select the observations you want it is useful to know some comparison operators
> greater than>= greater than or equal to!= not equal== equal& and| or! not%in% c(....) one in a list of elementsWhat do each of these lines of code filter the data for?
raw_data %>%
filter(region %in% c("Mara", "Pwani", "Dar-es-salaam"))
raw_data %>%
filter(age >= 30)These outputs can be saved as an object, exactly as you normally would.
Store your output table as an object:
# Save output as an object
subsetted_data <- raw_data %>%
filter(region %in% c("Mara", "Pwani", "Dar-es-salaam"))
# Print table
subsetted_dataSometimes, you may want to work with summaries of your data. The summarise function can be used to calculate summaries of variables in your data.
What do each of the following filters summarise?
raw_data %>%
summarise(n_males = length(which(Sex=="M")))
raw_data %>%
summarise(total_age = sum(Age))As mentioned earlier, dplyr functions can be stacked using the %>% (pipe) operator. For example, the summarise function can be combined with group_by to summarise variables by one or more columns.
How are these two tables different?
raw_data %>%
group_by(Sex) %>%
summarise(n_records = length(sex))
raw_data %>%
group_by(Region, Sex) %>%
summarise(total_age = sum(age))Fill in the blanks for the following lines in your R script
# Subset for only records with a dog
raw_data %>%
___(species=="dog")
# Subset for humans, and summarise the mean age per region
raw_data %>%
___(species=="human") %>%
group_by(___) %>%
___(mean_age = ___(age))library(ggplot2)
library(lubridate)
library(leaflet)ggplot() +
geom_bar(data=raw_data, aes(x=sex), fill=col_palette[1]) +
theme_classic()# Set start and end dates for time series
raw_data$date <- as.Date(raw_data$date)
ts_start <- as.Date(paste0(substr(min(raw_data$date),1,7), "-01"))
ts_end <- ceiling_date(max(raw_data$date),'month')
ts_breaks <- seq(ts_start, ts_end, by="month")
# Subset data
male_data <- raw_data %>% filter(sex=="M")
female_data <- raw_data %>% filter(sex=="F")
# Use histogram function to summarise numbers for each month
ts_male <- hist(male_data$date, plot=FALSE, breaks=ts_breaks)
ts_female <- hist(female_data$date, plot=FALSE, breaks=ts_breaks)
# Create a data frame containing the time series data
ts_data <- data.frame(date = rep(ts_breaks[1:length(ts_breaks)-1], 2),
sex = c(rep("Male", length(ts_male$counts)),
rep("Female", length(ts_female$counts))),
n = c(ts_male$counts, ts_female$counts))
# Plot
ggplot() +
geom_col(data=ts_data, aes(x=date, y=n, fill=sex)) +
labs(x="Date", y="Number") +
scale_fill_manual(name="Gender", values=col_palette[1:2]) +
theme_classic()# Create a new column specifying domestic vs. wildlife vs. human
raw_data$species_type[which(raw_data$species=="dog" | raw_data$species=="cat")] <- "Domestic"
raw_data$species_type[which(raw_data$species=="jackal" | raw_data$species=="lion")] <- "Wildlife"
raw_data$species_type[which(raw_data$species=="human")] <- "Human"
# Use only one year
leaflet_data <- raw_data %>%
mutate(year = substr(date, 1,4)) %>%
filter(year == 2014)
# Setup point colours using the colorFactor() function
leaflet_pal <- colorFactor(palette=col_palette[1:3], domain = unique(leaflet_data$species_type))
# Plot
leaflet() %>%
addPolygons(data=region_shp, weight=1, color="black", fillColor = "white", fillOpacity=1) %>%
addCircleMarkers(data=leaflet_data, lng=~x, lat=~y, color=~leaflet_pal(species_type),
radius=3, opacity = 1, fillOpacity=1, label=~species)